AdaptDB: Adaptive Partitioning for Distributed Joins

نویسندگان

  • Yi Lu
  • Anil Shanbhag
  • Alekh Jindal
  • Samuel Madden
چکیده

Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best partitioning scheme for a particular workload, rather than adapting to changes in the workload over time. In this paper, we present AdaptDB, an adaptive storage manager for analytical database workloads in a distributed setting. It works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run. AdaptDB introduces a novel hyper-join that avoids expensive data shuffling by identifying storage blocks of the joining tables that overlap on the join attribute, and only joining those blocks. Hyper-join performs well when each block in one table overlaps with few blocks in the other table, since that will minimize the number of blocks that have to be accessed. To minimize the number of overlapping blocks for common join queries, AdaptDB users smooth repartitioning to repartition small portions of the tables on join attributes as queries run. A prototype of AdaptDB running on top of Spark improves query performance by 2-3x on TPC-H as well as real-world dataset, versus a system that employs scans and shuffle-joins.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Optimization Technique for Spatial Compound Joins Based on a Topological Relationship Query and Buffering Analysis in DSDBs with Partitioning Fragmentation

Spatial Partitioning Fragmentation (SPF) is a popular method to partition data in Distributed Spatial Databases (DSDBs). The issue of cross-border queries is an inherent problem however with distributed spatial data queries based on partitioning fragmentation given a continuity and strong correlation of geospatial data. In the case of partitioning fragmentation, a global spatial join can be tra...

متن کامل

A Forward Scan based Plane Sweep Algorithm for Parallel Interval Joins

The interval join is a basic operation that finds application in temporal, spatial, and uncertain databases. Although a number of centralized and distributed algorithms have been proposed for the efficient evaluation of interval joins, classic plane sweep approaches have not been considered at their full potential. A recent piece of related work proposes an optimized approach based on plane swe...

متن کامل

Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory

Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit s...

متن کامل

Squall: Scalable Real-time Analytics using Efficient, Skew-resilient Join Operators

Squall is a scalable online query engine that runs complex analytics in a cluster using skewresilient, adaptive operators. Online processing implies that results are incrementally built as the input arrives, and it is ubiquitous for many applications such as algorithmic trading, clickstream analysis and business intelligence (e.g., in order to reach a potential customer during the active sessio...

متن کامل

I-Store: Data Management for Fast Networks

Motivation: Existing distributed data management systems typically assume that the network is a major bottleneck [10]. Consequently, avoiding remote data transfers is an important design aspect of existing systems. In extreme cases, this has lead to system designs, which explicitly do not support certain distributed operations (e.g., BigTable only supports joins if the inner table contains less...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017